Dataset statistics
| Number of variables | 9 |
|---|---|
| Number of observations | 409 |
| Missing cells | 0 |
| Missing cells (%) | 0.0% |
| Duplicate rows | 0 |
| Duplicate rows (%) | 0.0% |
| Total size in memory | 28.9 KiB |
| Average record size in memory | 72.3 B |
Variable types
| Numeric | 7 |
|---|---|
| Categorical | 2 |
ChEMBL ID has a high cardinality: 409 distinct values | High cardinality |
Smiles has a high cardinality: 409 distinct values | High cardinality |
HBA is highly correlated with HBD and 1 other fields | High correlation |
HBD is highly correlated with HBA and 1 other fields | High correlation |
PSA is highly correlated with HBA and 1 other fields | High correlation |
ChEMBL ID is uniformly distributed | Uniform |
Smiles is uniformly distributed | Uniform |
df_index has unique values | Unique |
ChEMBL ID has unique values | Unique |
Smiles has unique values | Unique |
HBD has 54 (13.2%) zeros | Zeros |
Reproduction
| Analysis started | 2022-01-15 12:13:52.875712 |
|---|---|
| Analysis finished | 2022-01-15 12:18:00.161833 |
| Duration | 4 minutes and 7.29 seconds |
| Software version | pandas-profiling v3.0.0 |
| Download configuration | config.json |
| Distinct | 409 |
|---|---|
| Distinct (%) | 100.0% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Mean | 470.0757946 |
| Minimum | 2 |
|---|---|
| Maximum | 988 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Negative | 0 |
| Negative (%) | 0.0% |
| Memory size | 3.3 KiB |
Quantile statistics
| Minimum | 2 |
|---|---|
| 5-th percentile | 43.2 |
| Q1 | 195 |
| median | 442 |
| Q3 | 748 |
| 95-th percentile | 946.2 |
| Maximum | 988 |
| Range | 986 |
| Interquartile range (IQR) | 553 |
Descriptive statistics
| Standard deviation | 301.3298574 |
|---|---|
| Coefficient of variation (CV) | 0.6410239814 |
| Kurtosis | -1.307572731 |
| Mean | 470.0757946 |
| Median Absolute Deviation (MAD) | 278 |
| Skewness | 0.1500746574 |
| Sum | 192261 |
| Variance | 90799.68297 |
| Monotonicity | Strictly increasing |
Histogram with fixed size bins (bins=50)
| Value | Count | Frequency (%) |
| 2 | 1 | 0.2% |
| 623 | 1 | 0.2% |
| 656 | 1 | 0.2% |
| 649 | 1 | 0.2% |
| 647 | 1 | 0.2% |
| 641 | 1 | 0.2% |
| 637 | 1 | 0.2% |
| 633 | 1 | 0.2% |
| 630 | 1 | 0.2% |
| 627 | 1 | 0.2% |
| Other values (399) | 399 |
| Value | Count | Frequency (%) |
| 2 | 1 | |
| 4 | 1 | |
| 7 | 1 | |
| 8 | 1 | |
| 9 | 1 | |
| 10 | 1 | |
| 11 | 1 | |
| 12 | 1 | |
| 14 | 1 | |
| 15 | 1 |
| Value | Count | Frequency (%) |
| 988 | 1 | |
| 983 | 1 | |
| 980 | 1 | |
| 979 | 1 | |
| 977 | 1 | |
| 975 | 1 | |
| 973 | 1 | |
| 972 | 1 | |
| 971 | 1 | |
| 969 | 1 |
| Distinct | 409 |
|---|---|
| Distinct (%) | 100.0% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory size | 3.3 KiB |
| CHEMBL67391 | 1 |
|---|---|
| CHEMBL3818875 | 1 |
| CHEMBL1256362 | 1 |
| CHEMBL3753077 | 1 |
| CHEMBL1672002 | 1 |
| Other values (404) |
Length
| Max length | 13 |
|---|---|
| Median length | 12 |
| Mean length | 12.36919315 |
| Min length | 9 |
Characters and Unicode
| Total characters | 5059 |
|---|---|
| Distinct characters | 16 |
| Distinct categories | 2 ? |
| Distinct scripts | 2 ? |
| Distinct blocks | 1 ? |
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.
Unique
| Unique | 409 ? |
|---|---|
| Unique (%) | 100.0% |
Sample
| 1st row | CHEMBL394875 |
|---|---|
| 2nd row | CHEMBL200381 |
| 3rd row | CHEMBL502351 |
| 4th row | CHEMBL492572 |
| 5th row | CHEMBL492591 |
Common Values
| Value | Count | Frequency (%) |
| CHEMBL67391 | 1 | 0.2% |
| CHEMBL3818875 | 1 | 0.2% |
| CHEMBL1256362 | 1 | 0.2% |
| CHEMBL3753077 | 1 | 0.2% |
| CHEMBL1672002 | 1 | 0.2% |
| CHEMBL473159 | 1 | 0.2% |
| CHEMBL3287735 | 1 | 0.2% |
| CHEMBL4224714 | 1 | 0.2% |
| CHEMBL2430359 | 1 | 0.2% |
| CHEMBL375270 | 1 | 0.2% |
| Other values (399) | 399 |
Length
Histogram of lengths of the category
| Value | Count | Frequency (%) |
| chembl567303 | 1 | 0.2% |
| chembl600764 | 1 | 0.2% |
| chembl91485 | 1 | 0.2% |
| chembl470334 | 1 | 0.2% |
| chembl464552 | 1 | 0.2% |
| chembl2178284 | 1 | 0.2% |
| chembl304087 | 1 | 0.2% |
| chembl2424812 | 1 | 0.2% |
| chembl2029422 | 1 | 0.2% |
| chembl1917204 | 1 | 0.2% |
| Other values (399) | 399 |
Most occurring characters
| Value | Count | Frequency (%) |
| C | 409 | 8.1% |
| H | 409 | 8.1% |
| E | 409 | 8.1% |
| M | 409 | 8.1% |
| B | 409 | 8.1% |
| L | 409 | 8.1% |
| 2 | 350 | 6.9% |
| 3 | 321 | 6.3% |
| 1 | 313 | 6.2% |
| 4 | 265 | 5.2% |
| Other values (6) | 1356 |
Most occurring categories
| Value | Count | Frequency (%) |
| Decimal Number | 2605 | |
| Uppercase Letter | 2454 |
Most frequent character per category
Decimal Number
| Value | Count | Frequency (%) |
| 2 | 350 | |
| 3 | 321 | |
| 1 | 313 | |
| 4 | 265 | |
| 0 | 258 | |
| 5 | 249 | |
| 9 | 224 | |
| 7 | 224 | |
| 6 | 217 | |
| 8 | 184 |
Uppercase Letter
| Value | Count | Frequency (%) |
| C | 409 | |
| H | 409 | |
| E | 409 | |
| M | 409 | |
| B | 409 | |
| L | 409 |
Most occurring scripts
| Value | Count | Frequency (%) |
| Common | 2605 | |
| Latin | 2454 |
Most frequent character per script
Common
| Value | Count | Frequency (%) |
| 2 | 350 | |
| 3 | 321 | |
| 1 | 313 | |
| 4 | 265 | |
| 0 | 258 | |
| 5 | 249 | |
| 9 | 224 | |
| 7 | 224 | |
| 6 | 217 | |
| 8 | 184 |
Latin
| Value | Count | Frequency (%) |
| C | 409 | |
| H | 409 | |
| E | 409 | |
| M | 409 | |
| B | 409 | |
| L | 409 |
Most occurring blocks
| Value | Count | Frequency (%) |
| ASCII | 5059 |
Most frequent character per block
ASCII
| Value | Count | Frequency (%) |
| C | 409 | 8.1% |
| H | 409 | 8.1% |
| E | 409 | 8.1% |
| M | 409 | 8.1% |
| B | 409 | 8.1% |
| L | 409 | 8.1% |
| 2 | 350 | 6.9% |
| 3 | 321 | 6.3% |
| 1 | 313 | 6.2% |
| 4 | 265 | 5.2% |
| Other values (6) | 1356 |
AlogP
Real number (ℝ)
| Distinct | 310 |
|---|---|
| Distinct (%) | 75.8% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Mean | 3.52405868 |
| Minimum | -2.32 |
|---|---|
| Maximum | 9.01 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Negative | 20 |
| Negative (%) | 4.9% |
| Memory size | 3.3 KiB |
Quantile statistics
| Minimum | -2.32 |
|---|---|
| 5-th percentile | 0.146 |
| Q1 | 2.6 |
| median | 3.65 |
| Q3 | 4.72 |
| 95-th percentile | 6.304 |
| Maximum | 9.01 |
| Range | 11.33 |
| Interquartile range (IQR) | 2.12 |
Descriptive statistics
| Standard deviation | 1.833432615 |
|---|---|
| Coefficient of variation (CV) | 0.5202616589 |
| Kurtosis | 0.713532252 |
| Mean | 3.52405868 |
| Median Absolute Deviation (MAD) | 1.07 |
| Skewness | -0.4876363405 |
| Sum | 1441.34 |
| Variance | 3.361475153 |
| Monotonicity | Not monotonic |
Histogram with fixed size bins (bins=50)
| Value | Count | Frequency (%) |
| 3.19 | 5 | 1.2% |
| 3.12 | 4 | 1.0% |
| 2.73 | 4 | 1.0% |
| 3.47 | 4 | 1.0% |
| 4.76 | 3 | 0.7% |
| 5.18 | 3 | 0.7% |
| 5.1 | 3 | 0.7% |
| 4.26 | 3 | 0.7% |
| 2.88 | 3 | 0.7% |
| 4.23 | 3 | 0.7% |
| Other values (300) | 374 |
| Value | Count | Frequency (%) |
| -2.32 | 1 | |
| -2.3 | 1 | |
| -2.06 | 1 | |
| -1.94 | 1 | |
| -1.77 | 1 | |
| -1.5 | 1 | |
| -1.39 | 1 | |
| -1.38 | 1 | |
| -1.22 | 1 | |
| -1.08 | 1 |
| Value | Count | Frequency (%) |
| 9.01 | 1 | |
| 8.37 | 1 | |
| 7.46 | 1 | |
| 7.36 | 1 | |
| 7.29 | 1 | |
| 7.28 | 1 | |
| 7.09 | 1 | |
| 7.04 | 1 | |
| 7 | 1 | |
| 6.99 | 1 |
| Distinct | 363 |
|---|---|
| Distinct (%) | 88.8% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Mean | 76.95643032 |
| Minimum | 0 |
|---|---|
| Maximum | 298.14 |
| Zeros | 2 |
| Zeros (%) | 0.5% |
| Negative | 0 |
| Negative (%) | 0.0% |
| Memory size | 3.3 KiB |
Quantile statistics
| Minimum | 0 |
|---|---|
| 5-th percentile | 25.504 |
| Q1 | 52.31 |
| median | 74.85 |
| Q3 | 96.35 |
| 95-th percentile | 135.71 |
| Maximum | 298.14 |
| Range | 298.14 |
| Interquartile range (IQR) | 44.04 |
Descriptive statistics
| Standard deviation | 36.10628212 |
|---|---|
| Coefficient of variation (CV) | 0.4691782346 |
| Kurtosis | 4.15544506 |
| Mean | 76.95643032 |
| Median Absolute Deviation (MAD) | 22.2 |
| Skewness | 1.165089748 |
| Sum | 31475.18 |
| Variance | 1303.663608 |
| Monotonicity | Not monotonic |
Histogram with fixed size bins (bins=50)
| Value | Count | Frequency (%) |
| 23.47 | 4 | 1.0% |
| 58.2 | 4 | 1.0% |
| 12.47 | 3 | 0.7% |
| 49.33 | 3 | 0.7% |
| 23.55 | 3 | 0.7% |
| 74.85 | 3 | 0.7% |
| 37.3 | 3 | 0.7% |
| 64.35 | 3 | 0.7% |
| 50.94 | 2 | 0.5% |
| 40.46 | 2 | 0.5% |
| Other values (353) | 379 |
| Value | Count | Frequency (%) |
| 0 | 2 | |
| 9.72 | 1 | 0.2% |
| 12.03 | 2 | |
| 12.47 | 3 | |
| 15.27 | 1 | 0.2% |
| 15.71 | 2 | |
| 20.23 | 1 | 0.2% |
| 21.26 | 1 | 0.2% |
| 23.47 | 4 | |
| 23.55 | 3 |
| Value | Count | Frequency (%) |
| 298.14 | 1 | |
| 218.99 | 1 | |
| 218.8 | 1 | |
| 208.65 | 1 | |
| 191.6 | 1 | |
| 178.3 | 1 | |
| 169.26 | 1 | |
| 166.75 | 1 | |
| 164.36 | 1 | |
| 159.51 | 1 |
| Distinct | 15 |
|---|---|
| Distinct (%) | 3.7% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Mean | 4.946210269 |
| Minimum | 0 |
|---|---|
| Maximum | 18 |
| Zeros | 2 |
| Zeros (%) | 0.5% |
| Negative | 0 |
| Negative (%) | 0.0% |
| Memory size | 3.3 KiB |
Quantile statistics
| Minimum | 0 |
|---|---|
| 5-th percentile | 2 |
| Q1 | 3 |
| median | 5 |
| Q3 | 6 |
| 95-th percentile | 9 |
| Maximum | 18 |
| Range | 18 |
| Interquartile range (IQR) | 3 |
Descriptive statistics
| Standard deviation | 2.276702458 |
|---|---|
| Coefficient of variation (CV) | 0.4602922913 |
| Kurtosis | 3.195541801 |
| Mean | 4.946210269 |
| Median Absolute Deviation (MAD) | 2 |
| Skewness | 0.9620279714 |
| Sum | 2023 |
| Variance | 5.183374083 |
| Monotonicity | Not monotonic |
Histogram with fixed size bins (bins=15)
| Value | Count | Frequency (%) |
| 5 | 72 | |
| 6 | 64 | |
| 3 | 62 | |
| 4 | 62 | |
| 2 | 47 | |
| 7 | 41 | |
| 8 | 29 | |
| 9 | 15 | 3.7% |
| 1 | 8 | 2.0% |
| 0 | 2 | 0.5% |
| Other values (5) | 7 | 1.7% |
| Value | Count | Frequency (%) |
| 0 | 2 | 0.5% |
| 1 | 8 | 2.0% |
| 2 | 47 | |
| 3 | 62 | |
| 4 | 62 | |
| 5 | 72 | |
| 6 | 64 | |
| 7 | 41 | |
| 8 | 29 | |
| 9 | 15 | 3.7% |
| Value | Count | Frequency (%) |
| 18 | 1 | 0.2% |
| 16 | 1 | 0.2% |
| 13 | 1 | 0.2% |
| 11 | 2 | 0.5% |
| 10 | 2 | 0.5% |
| 9 | 15 | 3.7% |
| 8 | 29 | |
| 7 | 41 | |
| 6 | 64 | |
| 5 | 72 |
| Distinct | 11 |
|---|---|
| Distinct (%) | 2.7% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Mean | 1.733496333 |
| Minimum | 0 |
|---|---|
| Maximum | 12 |
| Zeros | 54 |
| Zeros (%) | 13.2% |
| Negative | 0 |
| Negative (%) | 0.0% |
| Memory size | 3.3 KiB |
Quantile statistics
| Minimum | 0 |
|---|---|
| 5-th percentile | 0 |
| Q1 | 1 |
| median | 2 |
| Q3 | 2 |
| 95-th percentile | 4 |
| Maximum | 12 |
| Range | 12 |
| Interquartile range (IQR) | 1 |
Descriptive statistics
| Standard deviation | 1.370157337 |
|---|---|
| Coefficient of variation (CV) | 0.790401059 |
| Kurtosis | 10.27217497 |
| Mean | 1.733496333 |
| Median Absolute Deviation (MAD) | 1 |
| Skewness | 2.144057009 |
| Sum | 709 |
| Variance | 1.877331128 |
| Monotonicity | Not monotonic |
Histogram with fixed size bins (bins=11)
| Value | Count | Frequency (%) |
| 1 | 149 | |
| 2 | 118 | |
| 3 | 60 | |
| 0 | 54 | 13.2% |
| 4 | 14 | 3.4% |
| 5 | 8 | 2.0% |
| 6 | 2 | 0.5% |
| 7 | 1 | 0.2% |
| 8 | 1 | 0.2% |
| 9 | 1 | 0.2% |
| Value | Count | Frequency (%) |
| 0 | 54 | 13.2% |
| 1 | 149 | |
| 2 | 118 | |
| 3 | 60 | |
| 4 | 14 | 3.4% |
| 5 | 8 | 2.0% |
| 6 | 2 | 0.5% |
| 7 | 1 | 0.2% |
| 8 | 1 | 0.2% |
| 9 | 1 | 0.2% |
| Value | Count | Frequency (%) |
| 12 | 1 | 0.2% |
| 9 | 1 | 0.2% |
| 8 | 1 | 0.2% |
| 7 | 1 | 0.2% |
| 6 | 2 | 0.5% |
| 5 | 8 | 2.0% |
| 4 | 14 | 3.4% |
| 3 | 60 | |
| 2 | 118 | |
| 1 | 149 |
| Distinct | 409 |
|---|---|
| Distinct (%) | 100.0% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory size | 3.3 KiB |
| O=C(OC(C(F)(F)F)C(F)(F)F)N1CCN(Cc2cccc(Oc3ccccc3)c2)CC1 | 1 |
|---|---|
| COc1cc2ncnc(Nc3cccc(Br)c3)c2cc1OC.Cl | 1 |
| CC(C)(C)c1cc(C(=O)/C(C#N)=N/Nc2cccc(Cl)c2)no1 | 1 |
| CCCCCCCCCCCCCCCCNc1ccc(C(=O)O)cc1 | 1 |
| C=C[C@]1(C)CC[C@@H](C(=C)C)C[C@H]1C(=C)C | 1 |
| Other values (404) |
Length
| Max length | 223 |
|---|---|
| Median length | 49 |
| Mean length | 51.08557457 |
| Min length | 11 |
Characters and Unicode
| Total characters | 20894 |
|---|---|
| Distinct characters | 33 |
| Distinct categories | 8 ? |
| Distinct scripts | 2 ? |
| Distinct blocks | 1 ? |
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.
Unique
| Unique | 409 ? |
|---|---|
| Unique (%) | 100.0% |
Sample
| 1st row | CSC[C@H](N)C(=O)O |
|---|---|
| 2nd row | N#Cc1cnc2cnc(NCc3cccnc3)cc2c1Nc1ccc(F)c(Cl)c1 |
| 3rd row | COc1ccc(-c2cnc3c(-c4cccc5ncccc45)cnn3c2)cc1 |
| 4th row | C[C@@H](CN1CCC(n2c(=O)[nH]c3cc(Cl)ccc32)CC1)NC(=O)c1ccc2ccccc2c1 |
| 5th row | O=C(O)c1cc2occc2[nH]1 |
Common Values
| Value | Count | Frequency (%) |
| O=C(OC(C(F)(F)F)C(F)(F)F)N1CCN(Cc2cccc(Oc3ccccc3)c2)CC1 | 1 | 0.2% |
| COc1cc2ncnc(Nc3cccc(Br)c3)c2cc1OC.Cl | 1 | 0.2% |
| CC(C)(C)c1cc(C(=O)/C(C#N)=N/Nc2cccc(Cl)c2)no1 | 1 | 0.2% |
| CCCCCCCCCCCCCCCCNc1ccc(C(=O)O)cc1 | 1 | 0.2% |
| C=C[C@]1(C)CC[C@@H](C(=C)C)C[C@H]1C(=C)C | 1 | 0.2% |
| COc1ccccc1C1(O)CCN(CCCn2c3ccccc3c3ccccc32)CC1 | 1 | 0.2% |
| O=C(O)C1CN(Cc2ccc(-c3nc4cc(Cc5ccccc5F)ccc4s3)c(F)c2)C1 | 1 | 0.2% |
| CN(C)Cc1cccc(N/C(=C2\C(=O)Nc3cc(C(=O)N(C)C)ccc32)c2ccccc2)c1 | 1 | 0.2% |
| CC(C)Cc1ccc([C@@H](C)C(=O)O)cc1 | 1 | 0.2% |
| CC(C)c1cc(-c2ccc(F)c3ccccc23)nc(N)n1 | 1 | 0.2% |
| Other values (399) | 399 |
Length
Histogram of lengths of the category
| Value | Count | Frequency (%) |
| cc(=o)ncc(=o)n1[c@h]2cc[c@@h]1c1cc(nc3ncc(c(f)(f)f)c(nc4ccc4)n3)ccc12 | 1 | 0.2% |
| ccnc(=o)c1cc2c(-c3cc(c(c)(c)o)ccc3oc3c(c)cc(f)cc3c)cn(c)c(=o)c2[nh]1 | 1 | 0.2% |
| cc(c)c1ccc(-c2cc(=o)c3ccccc3o2)cc1 | 1 | 0.2% |
| o=c(o)ccn(o)c(=o)ccccccccc1cc1 | 1 | 0.2% |
| coc1ccc(ccn(c)ccc2ccc(oc)c(oc)c2)cc1oc.cl | 1 | 0.2% |
| coc1ccc(-c2coc3cc(o)cc(o)c3c2=o)cc1 | 1 | 0.2% |
| o=c(o)/c=c\c(=o)o.o=c(c1ccc(occcc2c[nh]cn2)cc1)c1cc1 | 1 | 0.2% |
| ccoc1ccccc1-n1c(c(c)n2ccn(c(=o)coc3ccc(cl)cc3)cc2)nc2ccccc2c1=o | 1 | 0.2% |
| o=c1cc(n2ccocc2)oc2c(-c3ccccc3)cccc12 | 1 | 0.2% |
| oc[c@h]1o[c@@h](oc2cc3c(o)cc(o)cc3[o+]c2-c2ccc(o)c(o)c2)[c@h](o)[c@@h](o)[c@@h]1o | 1 | 0.2% |
| Other values (399) | 399 |
Most occurring characters
| Value | Count | Frequency (%) |
| c | 5048 | |
| C | 3392 | |
| ( | 1958 | 9.4% |
| ) | 1958 | 9.4% |
| 1 | 1180 | 5.6% |
| O | 1082 | 5.2% |
| 2 | 968 | 4.6% |
| N | 727 | 3.5% |
| n | 631 | 3.0% |
| = | 598 | 2.9% |
| Other values (23) | 3352 |
Most occurring categories
| Value | Count | Frequency (%) |
| Uppercase Letter | 5913 | |
| Lowercase Letter | 5895 | |
| Decimal Number | 2838 | |
| Open Punctuation | 2405 | |
| Close Punctuation | 2405 | |
| Other Punctuation | 646 | 3.1% |
| Math Symbol | 617 | 3.0% |
| Dash Punctuation | 175 | 0.8% |
Most frequent character per category
Uppercase Letter
| Value | Count | Frequency (%) |
| C | 3392 | |
| O | 1082 | 18.3% |
| N | 727 | 12.3% |
| H | 377 | 6.4% |
| F | 244 | 4.1% |
| S | 74 | 1.3% |
| B | 16 | 0.3% |
| I | 1 | < 0.1% |
Lowercase Letter
| Value | Count | Frequency (%) |
| c | 5048 | |
| n | 631 | 10.7% |
| l | 129 | 2.2% |
| s | 40 | 0.7% |
| o | 29 | 0.5% |
| r | 16 | 0.3% |
| a | 2 | < 0.1% |
Decimal Number
| Value | Count | Frequency (%) |
| 1 | 1180 | |
| 2 | 968 | |
| 3 | 474 | |
| 4 | 166 | 5.8% |
| 5 | 46 | 1.6% |
| 6 | 4 | 0.1% |
Other Punctuation
| Value | Count | Frequency (%) |
| @ | 512 | |
| . | 51 | 7.9% |
| / | 45 | 7.0% |
| # | 28 | 4.3% |
| \ | 10 | 1.5% |
Open Punctuation
| Value | Count | Frequency (%) |
| ( | 1958 | |
| [ | 447 | 18.6% |
Close Punctuation
| Value | Count | Frequency (%) |
| ) | 1958 | |
| ] | 447 | 18.6% |
Math Symbol
| Value | Count | Frequency (%) |
| = | 598 | |
| + | 19 | 3.1% |
Dash Punctuation
| Value | Count | Frequency (%) |
| - | 175 |
Most occurring scripts
| Value | Count | Frequency (%) |
| Latin | 11808 | |
| Common | 9086 |
Most frequent character per script
Common
| Value | Count | Frequency (%) |
| ( | 1958 | |
| ) | 1958 | |
| 1 | 1180 | |
| 2 | 968 | |
| = | 598 | 6.6% |
| @ | 512 | 5.6% |
| 3 | 474 | 5.2% |
| [ | 447 | 4.9% |
| ] | 447 | 4.9% |
| - | 175 | 1.9% |
| Other values (8) | 369 | 4.1% |
Latin
| Value | Count | Frequency (%) |
| c | 5048 | |
| C | 3392 | |
| O | 1082 | 9.2% |
| N | 727 | 6.2% |
| n | 631 | 5.3% |
| H | 377 | 3.2% |
| F | 244 | 2.1% |
| l | 129 | 1.1% |
| S | 74 | 0.6% |
| s | 40 | 0.3% |
| Other values (5) | 64 | 0.5% |
Most occurring blocks
| Value | Count | Frequency (%) |
| ASCII | 20894 |
Most frequent character per block
ASCII
| Value | Count | Frequency (%) |
| c | 5048 | |
| C | 3392 | |
| ( | 1958 | 9.4% |
| ) | 1958 | 9.4% |
| 1 | 1180 | 5.6% |
| O | 1082 | 5.2% |
| 2 | 968 | 4.6% |
| N | 727 | 3.5% |
| n | 631 | 3.0% |
| = | 598 | 2.9% |
| Other values (23) | 3352 |
MW
Real number (ℝ≥0)
| Distinct | 405 |
|---|---|
| Distinct (%) | 99.0% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Mean | 374.9159487 |
| Minimum | 89.094 |
|---|---|
| Maximum | 923.449 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Negative | 0 |
| Negative (%) | 0.0% |
| Memory size | 3.3 KiB |
Quantile statistics
| Minimum | 89.094 |
|---|---|
| 5-th percentile | 176.9946 |
| Q1 | 296.414 |
| median | 369.343 |
| Q3 | 452.467 |
| 95-th percentile | 583.3322 |
| Maximum | 923.449 |
| Range | 834.355 |
| Interquartile range (IQR) | 156.053 |
Descriptive statistics
| Standard deviation | 124.9275481 |
|---|---|
| Coefficient of variation (CV) | 0.333214814 |
| Kurtosis | 1.118962417 |
| Mean | 374.9159487 |
| Median Absolute Deviation (MAD) | 78.177 |
| Skewness | 0.4405301814 |
| Sum | 153340.623 |
| Variance | 15606.89227 |
| Monotonicity | Not monotonic |
Histogram with fixed size bins (bins=50)
| Value | Count | Frequency (%) |
| 369.343 | 2 | 0.5% |
| 321.38 | 2 | 0.5% |
| 163.173 | 2 | 0.5% |
| 404.462 | 2 | 0.5% |
| 516.667 | 1 | 0.2% |
| 362.363 | 1 | 0.2% |
| 308.337 | 1 | 0.2% |
| 440.367 | 1 | 0.2% |
| 337.375 | 1 | 0.2% |
| 520.637 | 1 | 0.2% |
| Other values (395) | 395 |
| Value | Count | Frequency (%) |
| 89.094 | 1 | |
| 96.133 | 1 | |
| 114.104 | 1 | |
| 115.132 | 1 | |
| 116.12 | 1 | |
| 117.148 | 1 | |
| 126.111 | 1 | |
| 127.168 | 1 | |
| 129.115 | 1 | |
| 136.154 | 1 |
| Value | Count | Frequency (%) |
| 923.449 | 1 | |
| 862.746 | 1 | |
| 789.099 | 1 | |
| 785.025 | 1 | |
| 669.777 | 1 | |
| 658.707 | 1 | |
| 637.744 | 1 | |
| 635.942 | 1 | |
| 622.84 | 1 | |
| 621.915 | 1 |
pIC50
Real number (ℝ≥0)
| Distinct | 273 |
|---|---|
| Distinct (%) | 66.7% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Mean | 6.355827231 |
| Minimum | 2.065501549 |
|---|---|
| Maximum | 10.1426675 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Negative | 0 |
| Negative (%) | 0.0% |
| Memory size | 3.3 KiB |
Quantile statistics
| Minimum | 2.065501549 |
|---|---|
| 5-th percentile | 3.862782522 |
| Q1 | 5 |
| median | 6.522878745 |
| Q3 | 7.657577319 |
| 95-th percentile | 8.821607809 |
| Maximum | 10.1426675 |
| Range | 8.077165955 |
| Interquartile range (IQR) | 2.657577319 |
Descriptive statistics
| Standard deviation | 1.65184478 |
|---|---|
| Coefficient of variation (CV) | 0.2598945378 |
| Kurtosis | -0.7937361524 |
| Mean | 6.355827231 |
| Median Absolute Deviation (MAD) | 1.363177902 |
| Skewness | -0.09665965835 |
| Sum | 2599.533337 |
| Variance | 2.728591178 |
| Monotonicity | Not monotonic |
Histogram with fixed size bins (bins=50)
| Value | Count | Frequency (%) |
| 4 | 14 | 3.4% |
| 5 | 14 | 3.4% |
| 7 | 6 | 1.5% |
| 8.096910013 | 6 | 1.5% |
| 6 | 5 | 1.2% |
| 8 | 5 | 1.2% |
| 7.397940009 | 5 | 1.2% |
| 8.522878745 | 5 | 1.2% |
| 7.795880017 | 5 | 1.2% |
| 8.045757491 | 5 | 1.2% |
| Other values (263) | 339 |
| Value | Count | Frequency (%) |
| 2.065501549 | 1 | |
| 2.301029996 | 2 | |
| 2.460000001 | 1 | |
| 2.823908741 | 1 | |
| 3 | 1 | |
| 3.086186148 | 1 | |
| 3.301029996 | 1 | |
| 3.406160339 | 1 | |
| 3.493494968 | 1 | |
| 3.494850022 | 1 |
| Value | Count | Frequency (%) |
| 10.1426675 | 1 | 0.2% |
| 10.04575749 | 1 | 0.2% |
| 9.920818754 | 1 | 0.2% |
| 9.886056648 | 2 | |
| 9.522878745 | 1 | 0.2% |
| 9.397940009 | 2 | |
| 9.301029996 | 3 | |
| 9.299988938 | 1 | 0.2% |
| 9.180456064 | 1 | 0.2% |
| 9.113509275 | 1 | 0.2% |
Pearson's r
The Pearson's correlation coefficient (r) is a measure of linear correlation between two variables. It's value lies between -1 and +1, -1 indicating total negative linear correlation, 0 indicating no linear correlation and 1 indicating total positive linear correlation. Furthermore, r is invariant under separate changes in location and scale of the two variables, implying that for a linear function the angle to the x-axis does not affect r.To calculate r for two variables X and Y, one divides the covariance of X and Y by the product of their standard deviations.
Spearman's ρ
The Spearman's rank correlation coefficient (ρ) is a measure of monotonic correlation between two variables, and is therefore better in catching nonlinear monotonic correlations than Pearson's r. It's value lies between -1 and +1, -1 indicating total negative monotonic correlation, 0 indicating no monotonic correlation and 1 indicating total positive monotonic correlation.To calculate ρ for two variables X and Y, one divides the covariance of the rank variables of X and Y by the product of their standard deviations.
Kendall's τ
Similarly to Spearman's rank correlation coefficient, the Kendall rank correlation coefficient (τ) measures ordinal association between two variables. It's value lies between -1 and +1, -1 indicating total negative correlation, 0 indicating no correlation and 1 indicating total positive correlation.To calculate τ for two variables X and Y, one determines the number of concordant and discordant pairs of observations. τ is given by the number of concordant pairs minus the discordant pairs divided by the total number of pairs.
Phik (φk)
Phik (φk) is a new and practical correlation coefficient that works consistently between categorical, ordinal and interval variables, captures non-linear dependency and reverts to the Pearson correlation coefficient in case of a bivariate normal input distribution. There is extensive documentation available here. A simple visualization of nullity by column.
Nullity matrix is a data-dense display which lets you quickly visually pick out patterns in data completion.
First rows
| df_index | ChEMBL ID | AlogP | PSA | HBA | HBD | Smiles | MW | pIC50 | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | CHEMBL394875 | -0.24 | 63.32 | 3 | 2 | CSC[C@H](N)C(=O)O | 352.397 | 4.200659 |
| 1 | 4 | CHEMBL200381 | 5.04 | 86.52 | 6 | 2 | N#Cc1cnc2cnc(NCc3cccnc3)cc2c1Nc1ccc(F)c(Cl)c1 | 151.121 | 7.301030 |
| 2 | 7 | CHEMBL502351 | 4.62 | 52.31 | 5 | 0 | COc1ccc(-c2cnc3c(-c4cccc5ncccc45)cnn3c2)cc1 | 380.806 | 5.522879 |
| 3 | 8 | CHEMBL492572 | 4.59 | 70.13 | 4 | 2 | C[C@@H](CN1CCC(n2c(=O)[nH]c3cc(Cl)ccc32)CC1)NC(=O)c1ccc2ccccc2c1 | 603.601 | 7.337242 |
| 4 | 9 | CHEMBL492591 | 1.46 | 66.23 | 2 | 2 | O=C(O)c1cc2occc2[nH]1 | 337.261 | 6.850781 |
| 5 | 10 | CHEMBL494772 | 2.37 | 96.28 | 6 | 2 | COc1cc(Cc2cnc(N)nc2N)c(C(C)C)cc1OC | 465.554 | 5.700001 |
| 6 | 11 | CHEMBL561103 | 3.74 | 110.53 | 10 | 1 | COC(=O)Nc1ccc(-c2nc(N3CCOCC3)c3cnn(C4CCN(Cc5cccnc5)CC4)c3n2)cc1 | 509.393 | 8.337242 |
| 7 | 12 | CHEMBL559525 | 4.69 | 58.20 | 2 | 2 | O=C(NC(=O)c1c(F)cccc1Cl)NC1c2ccccc2-c2ccccc21 | 188.231 | 8.818156 |
| 8 | 14 | CHEMBL397983 | 5.81 | 99.44 | 8 | 0 | CCOc1ccc(-n2c([C@@H](C)N(Cc3cccnc3)C(=O)Cc3ccc(OC(F)(F)F)cc3)nc3ncccc3c2=O)cc1 | 387.867 | 8.096910 |
| 9 | 15 | CHEMBL1096283 | 1.06 | 82.67 | 7 | 0 | Cn1nc(-c2ccc(C(F)(F)F)cc2)nc2c(=O)n(C)c(=O)nc1-2 | 396.672 | 3.698970 |
Last rows
| df_index | ChEMBL ID | AlogP | PSA | HBA | HBD | Smiles | MW | pIC50 | |
|---|---|---|---|---|---|---|---|---|---|
| 399 | 969 | CHEMBL539313 | 1.77 | 33.62 | 3 | 1 | CCC1(C2=NCCN2)Cc2ccccc2O1.Cl | 189.127 | 5.200000 |
| 400 | 971 | CHEMBL2430359 | 3.14 | 129.21 | 8 | 3 | CN(c1ncccc1CNc1nc(Nc2ccc3c(c2)CC(=O)N3)ncc1C(F)(F)F)S(C)(=O)=O.O=S(=O)(O)c1ccccc1 | 214.648 | 7.853872 |
| 401 | 972 | CHEMBL592374 | 2.85 | 12.03 | 1 | 1 | Clc1ccc([C@]23CNC[C@H]2C3)cc1Cl | 350.447 | 7.292430 |
| 402 | 973 | CHEMBL600764 | 3.35 | 65.79 | 4 | 1 | O=C(CN1CCN(C(=O)c2ccco2)CC1)Nc1cc(C(F)(F)F)ccc1Cl | 300.266 | 5.000000 |
| 403 | 975 | CHEMBL583042 | 5.43 | 102.93 | 5 | 4 | Cc1cnc(Nc2ccc(F)cc2Cl)nc1-c1c[nH]c(C(=O)N[C@H](CO)c2cccc(Cl)c2)c1 | 497.599 | 7.318759 |
| 404 | 977 | CHEMBL327002 | 5.37 | 138.18 | 7 | 3 | Cc1cc(C)c(N(Cc2ccccc2)S(=O)(=O)c2ccc(OCCNC(=O)c3cc4ccccc4o3)cc2)c(C(=O)NO)c1 | 469.574 | 7.769551 |
| 405 | 979 | CHEMBL254760 | 5.68 | 81.07 | 7 | 2 | Cn1cnc(-c2cc3nccc(Oc4ccc(NC(=S)NC(=O)Cc5ccccc5)cc4F)c3s2)c1 | 250.294 | 7.537602 |
| 406 | 980 | CHEMBL539507 | 3.33 | 12.47 | 2 | 0 | C#CCN(C)CCCOc1ccc(Cl)cc1Cl.Cl | 163.173 | 4.301030 |
| 407 | 983 | CHEMBL1553629 | 5.35 | 48.13 | 2 | 2 | CCc1c(C(=O)NCCc2ccc(N3CCCCC3)cc2)[nH]c2ccc(Cl)cc12 | 151.209 | 5.000000 |
| 408 | 988 | CHEMBL473159 | 0.80 | 60.69 | 3 | 3 | Oc1cc(O)cc(O)c1 | 367.788 | 3.406160 |